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B.  INTRODUCTION 

Interval-censored  (IC)  data  are  encountered  in  three  areas  of  breast  cancer  research. 
The  most  common  application  is  in  clinical  relapse  follow-up  studies  in  which  the  study 
endpoint  is  disease-free  survival.  When  a  patient  relapses,  it  is  usually  known  that  the 
relapse  takes  place  between  two  follow-up  visits,  and  the  exact  time  to  relapse  is  unknown. 
In  statistics,  we  say  relapse  time  is  interval  censored.  Interval  censoring  is  also  encountered 
in  breast  cancer  registry  studies  in  which  information  on  family  history  of  cancer  is  updated 
periodically.  The  Strang  Breast  Surveillance  Program  for  women  at  increased  risk  for  breast 
cancer,  for  instance,  has  enlisted  over  800  women  with  complete  pedigree  information  which 
is  verified  and  updated  continuously.  Family  history  data  such  as  age  at  diagnosis  of  a 
specific  cancer,  or  a  benign  but  risk-conferring  condition,  are  obtained  from  each  registrant 
at  each  update.  Time  to  a  cancer  event,  and  definitely  time  to  first  detection  of  a  benign 
condition,  are  at  best  known  to  fall  in  the  time  interval  between  the  last  update  and  age 
at  diagnosis.  A  third  but  increasingly  important  area  of  application  of  interval  censoring 
is  in  breast  cancer  chemoprevention  experiments  or  prevention  trials,  which  involve  the 
observation  of  one  or  more  surrogate  endpoint  biomarkers  (SEB)  over  time.  The  scientific 
question  of  interest  here  is  the  estimation  of  time  for  the  SEB  to  reach  a  target  value,  and 
time  from  cessation  of  intake  of  a  chemopreventive  agent  to  the  loss  of  its  protective  effect. 
Unfortunately,  the  exact  values  of  both  these  time  variables  are  known  only  to  lie  in  between 
two  successive  assay  inspection  times. 

Let  X  denote  a  time-to-event  variable  with  distribution  F(x)  =  Pr(X  <  x),  or  equiv¬ 
alently,  survival  function  S(x)  =  1  —  F(x).  In  interval  censoring,  X  is  not  observed  and 
is  known  only  to  lie  in  an  observable  interval  ( L,R ).  In  our  previous  DOD  funded  grant, 
we  have  made  fundamental  contributions  to  both  the  theory  of  the  generalized  maximum 
likelihood  (GML)  estimation  of  S,  and  the  computation  in  connection  with  the  inference  of 
GML  estimator  (GMLE)  S  of  S.  These  contributions  are  restricted  to  the  case  of  univariate 
interval-censored  data. 

Multivariate  interval  censoring  involves  d  >  2  correlated  X  variables,  each  of  which 
is  subject  to  interval  censoring.  The  main  statistical  concern  here  is  the  GML  estimation 
of  the  joint  survival  function  S(x\,  ...,£<*)  =  Pr(X \  >  xi, ...,  Xj  >  xj),  and  the  correla¬ 
tions  among  the  variables.  Our  interest  in  multivariate  IC  data  is  driven  by  needs  arising 
from  two  related  areas  of  breast  cancer  research  at  Strang.  First,  our  investigators  in  the 
Strang  Cancer  Genetics  Program  want  to  study  various  patterns  of  familial  aggregation  of 
breast,  ovarian  and  other  forms  of  cancer  using  family  history  data  from  the  Strang  Breast 
Surveillance  Program.  Studies  of  familial  early  onset  of  breast  cancer,  breast-ovarian  and 
breast-prostate  associations  will  lead  to  multivariate  IC  data  of  high  dimensions;  therefore, 
a  proper  statistical  procedure  together  with  a  feasible  software  to  deal  with  such  data  are 
very  much  needed.  Second,  we  are  conducting  a  one-year  chemoprevention  trial  of  indole-3- 
carbinol  (I3C)  for  breast  cancer  prevention.  In  this  prevention  trial  we  are  monitoring  the 
levels  of  two  SEB’s,  a  urinary  estrogen  metabolite  ratio  and  a  blood  counterpart,  both  of 
which  are  subject  to  interval  censoring.  An  earlier  dose-ranging  study  of  I3C  conducted  by 
Wong  et  al  [1]  has  been  published. 

Statistical  analysis  of  multivariate  IC  data  has  never  been  attempted.  In  the  multivari¬ 
ate  situation,  modeling  of  the  intercorrelated  time-to-event  variables  and  their  dependency 
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structure  will  require  a  great  deal  of  innovative  thinking;  moreover,  GML  computation  in 
realistic  sample  sizes  can  be  prohibitively  difficult. 

The  overall  aim  of  this  research  proposal  is  to  develop  statistical  inference  for  multi¬ 
variate  interval-censored  data  that  are  encountered  in  breast  cancer  chemoprevention  trials 
employing  multiple  surrogate  endpoint  biomarkers,  and  in  breast  cancer  registry  follow-up 
studies  of  familial  aggregation  of  breast  and  other  forms  of  cancer.  Asymptotic  general¬ 
ized  maximum  likelihood  theory  will  be  investigated  and  computer  software  package  for 
maximum  likelihood  inference  and  Kaplan-Meier  type  survival  plots  will  be  implemented. 


C.  BODY 

Consider  nonparametric  estimation  of  the  joint  survival  function  S(x i,  ...,£<*)  = 

Pr(A'1  >  x\,  ...,Xd  >  Xd)  of  d  >  2  intercorrelated  time-to-event  variables  Xlf  ...,  Xd,  each 
of  which  is  subject  to  interval  censoring.  For  ease  of  presentation  and  without  any  loss  of 
generality,  we  shall  restrict  our  discussion  to  the  bivariate  case  X  —  (Xi,  X2). 

Let  ( Ui,V{ )  denote  two  consecutive  follow-up  times  corresponding  to  Xi,  and  (Li,  Ri) 
denote  the  observable  interval-censored  (IC)  data  for  Xi  defined  as 

f(0,Ui)  if  Xi<Ui, 

(Li,Ri)={(Ui,Vi)  if  Ut<Xi<Vi,  (<7.1) 

[  {Vi,  +00)  if  Xi>Vit 

for  *  =  1,  2.  Under  this  two-dimensional  interval  censorship  model,  data  are  always  interval 
censored,  i.e.,  Li  <  Ri  with  probability  one.  If  we  allow  the  possibility  of  having  exact 
observations  in  the  data,  so  that 

Li  =  Ri  =  Xu  (C.  2) 

then  (C.l)  and  (C.2)  together  define  a  two-dimensional  mixed  interval  censorship  model. 

Let  Bi  denote  any  one  of  [0,  Ui],  (Ui,  Vi]  and  (Vi,  +00).  Therefore,  a  bivariate  IC  data 
point  is  a  rectangular  region  in  V?  taking  one  of  the  nine  forms  in  B  =  {Bk  x  Bi  :  k,l  = 
1,2,3}.  Given  a  sample  of  size  n,  the  observations  ( Ln,Rn,Li2,Ri2 )  can  be  represented 
by  rectangle  subsets  R  G  B,  for  i  =  1,  ...,  n.  Define  a  maximal  intersection  (MI)  A  of  the 
observable  rectangles  /1,  to  be  a  nonempty  finite  intersection  of  the  R' s  such  that 

A  n  R  =  0  or  A,  for  each  i.  Let  A\,  ...,  Am,  denote  the  distinct  maximal  intersections  with 
respect  to  R,  ...,  In. 

The  generalized  likelihood  function  of  S  is  given  by  An  =  f^s(h)  x  •  •  •  x  /i$(In),  where 
l j,s( •)  is  the  probability  measure  induced  by  S.  Wong  and  Yu  [2]  show  that  the  GMLE  S, 
which  maximizes  An,  must  assign  all  the  probability  masses  si,  ...,  sm  to  A\,  ...,  Am.  In 
general,  S  has  to  be  obtained  iteratively.  Since  S  is  also  a  self-consistent  estimate  (SCE), 
we  can  implement  the  SCE  algorithm  by  solving  for  si,  ...,  sm  in 


s 


j 


&ijsj 

n  i=i  1  $iksk 


(C.2) 


j  =  1,  ...,  m,  where  $ij  =  l[Aj  C  R],  1[-]  denoting  the  indicator  function,  and  obtain  an 
SCE  of  S(x) 

S(x)  = 

AjC{xi,+co)x---x(xd,+co) 
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With  starting  values  =  1/m  for  all  j,  S(x)  is  the  GMLE  at  convergence. 

We  have  implemented  an  algorithm  to  identify  Mi’s  corresponding  to  a  set  of  rectangle 
j Ti,  In,  and  a  computer  program  to  calculate  the  GMLE  iteratively.  The  programs  are 
installed  in  the  internet  site  math.binghamton.edu/ftp/pub/qyu.  This  completes  Task  1. 
We  have  established  uniform  consistency  of  S  by  proving 

Pr{  lim  sup  \S(x)  —  S'(x)|  =  0}  =  1 
n^°°  x  is  observable 

under  condition  Cl  (Task  2a)  and  under  condition  C2  (Task  2b): 

Cl.  The  censoring  vectors  (U\,  Vi)  and  (U2,  V2)  take  on  countably  many  values. 

C2.  The  censoring  distribution  G  of  (U,  V)  is  continuous,  and  some  regularity  assumptions 
are  imposed  on  either  S  or  G. 

The  above  consistency  results  that  we  have  accomplished  in  our  first  two  years  of 
research  are  published  in  a  peer-reviewed  statistical  journal  ([2])  and  reported  in  the  Ph.D 
thesis  [9]  of  Dr.  Shaohua  Yu  under  the  supervision  of  Professor  Qiqing  Yu. 

Asymptotic  normality  results  are  fundamentally  important  for  confidence  statements 

and  hypothesis  testing  in  data  analysis.  We  have  proved  that  \/n(S(x)  —  S(x))/a 
1V(0, 1),  where  a2  is  the  inverse  of  the  observed  Fisher  information  number,  or  equivalently, 
S  is  both  asymptotically  normal  and  asymptotically  efficient  under  condition  D1  (Tasks 
3a, b): 

Dl.  (Ui,Vi)  and  (C/2,  V2)  take  on  finitely  many  values,  say  ai,  ...,  a/y,  and  S(ak )  >  S(ai), 
if  aki  <  an  and  <  «Z2  with  at  least  one  strict  inequality,  =  (a/ci, 0.^,2)  and 
04  =  (an,  0,12). 

The  above  asymptotic  normality  results  that  we  have  accomplished  in  our  first  two 
years  of  research  are  published  in  a  peer-reviewed  statistical  journal  ([2]). 

Our  research  effort  for  the  second  year  is  focused  on  the  asymptotic  normality  of  the 
GMLE  under  conditions  D2  and  D3  (Tasks  3c,d,e,f): 

D2.  S  is  arbitrary,  (Ui,V\)  and  ( U2,V^ )  takes  on  countably  many  values,  and  the  strict 
monotonicity  condition  in  Dl  holds. 

D3.  S  is  arbitrary,  G  is  continuous,  and  either  S  or  G  meets  some  reasonable  smooth 
regularity  conditions. 

The  following  are  results  we  have  established  in  the  second  year  of  our  research: 

1.  If  there  are  no  exact  observations  in  the  data,  then  the  asymptotic  normality  for  the 
GMLE  does  not  hold  under  assumption  D3.  If  the  GMLE  has  an  asymptotic  normal 
distribution,  then  its  marginal  asymptotic  distribution  function  must  also  be  a  uni¬ 
variate  normal  distribution.  However,  Groeneboom  and  Wellner  [3]  have  shown  that 
if  the  underlying  distribution  functions  are  continuous,  then  the  GMLE  will  not  have 
an  asymptotic  normal  distribution.  Thus  the  GMLE  of  S  with  multivariate  interval- 
censored  data  cannot  be  asymptotic  normal  under  conditions  D3. 

2.  If  there  exist  exact  observations  in  the  data,  then  under  assumption  D2  or  D3,  the 
GMLE  of  S  has  asymptotic  normality  and  efficiency  under  the  mixed  interval  censorship 


7 


model.  A  manuscript  that  summarizes  the  asymptotic  normality  under  assumptions  D2 
or  D3  for  such  a  model  is  being  prepared. 

We  have  also  worked  on  Task  4.  In  particular,  we  have  studied  the  large-sample  prop¬ 
erties  of  the  weighted  Kaplan-Meier  test  statistics  given  by 

D=f  [  W(x)(SA(x)  -  SB(x))d(x), 

J  x> 0 J £>0 

where  W(-)  is  a  given  weight  function,  and  A  and  B  refer  to  two  comparison  conditions.  Un¬ 
der  condition  Dl,  we  have  established  consistency  and  asymptotic  normality  of  the  statistic 
D  (Task  4a). 

When  W(x)  =  1,  and  A  and  B  represent  two  independent  samples,  then 

D  =  i>o  SA(x)dx~j^  SB(x)dx,  Var(D)  =  Var(f^0  SA(x)dx)  +Var(f^0  SB(x)dx).  A 

consistent  estimator  of  Var(D)  can  easily  be  derived,  and  the  P- value  of  D  can  be  computed. 

We  have  also  studied  other  weight  functions  W(-)  and  the  case  that  A  and  B  axe  not 

independent.  The  derivation  is  not  as  simple  and  will  not  be  discussed  here.  A  manuscript 

that  derives  the  asymptotic  distribution  of  D  when  W (x)  ^  1  or  the  sets  A  and  B  are  not 

independent  is  under  preparation  (Tasks  4b, c). 

Furthermore,  we  have  encountered  a  non-uniqueness  problem  in  the  GML  inferences 

that  we  had  not  expected  when  we  submitted  our  proposal,  namely,  the  solution  of  the 

GMLE  of  S  for  some  multivariate  interval-censored  data  is  not  unique.  We  point  out  that  the 

GMLE  solution  for  univariate  interval-censored  data  is  always  unique.  As  a  consequence,  the 

sample  information  matrix  Ja  =  -  (  8l'°l An  )  a  ,  where  An  is  the  generalized 

*  V  osiOsj  )  S=S 

likelihood  function,  is  singular  and  its  inverse  does  not  exist.  Since  the  variance  estimation 
of  the  GMLE  is  based  on  the  inverse  of  J§,  it  is  therefore  important  to  resolve  the  non¬ 
uniqueness  problem. 

The  program  for  deriving  a  GMLE  estimator  of  S  that  we  have  accomplished  in  the 
first  year  of  our  project  is  still  applicable,  even  if  there  are  multiple  solutions.  However,  if 
there  are  multiple  solutions,  the  program  cannot  provide  an  estimator  of  the  variance  of  the 
GMLE  and  thus  cannot  provide  confidence  intervals.  We  present  an  artificial  bivariate  data 
set  that  gives  rise  to  non-uniqueness  of  the  GMLE  solution. 

Example  1  Suppose  that  a  sample  of  size  4  consists  of  observations  (L*i,  Rn,  L^, Ra), 
i  =  1, ...,  4,  which  equal  (1,6, 1,3),  (1,6, 4, 6),  (1,3, 1,6)  and  (4,6, 1,6),  respectively.  Then 
the  Mi’s  are  Ax  =  (1, 3]  x  (1, 3],  A2  =  (1, 3]  x  (4, 6],  A3  =  (4, 6]  x  (1, 3]  and  A4  =  (4, 6]  x  (4, 6]. 
S9  =  g(l/2,0,0, 1/2)  +  (1  -  q) (0,1/2, 1/2,0),  q  £  (0,1),  are  all  GMLEs  of  S.  The  sample 
information  matrix  Jg  is 

(Oi2  +  &13  +  0,24  +  034  Oi2  +  a$ 4  ai3  +  a24  \ 

a12  +  a34  O12  +  ®34  0  j 

a13  +  a24  0  ai3  +  <224  / 

where  =  (sj  +  Sj)~2.  Note  that  the  first  column  of  the  matrix  is  the  sum  of  the  next  two 
columns.  Consequently,  Jg  is  singular,  and  the  inverse  matrix  of  Jg  does  not  exist. 
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We  have  proposed  a  method  to  resolve  the  non-uniqueness  problem.  By  proving  a  result 
in  linear  algebra,  we  have  proposed  to  estimate  S  by  a  special  GMLE,  and  to  estimate  its 
covariance  matrix  by  a  certain  procedure.  There  are  four  main  steps: 

1.  Using  the  self-consistent  algorithm,  we  first  find  a  GMLE  of  F,  denoted  by  F. 

2.  We  established  in  [8]  that  each  solution  to  the  system  of  equations 

m  m 

^  ^  &ij sj  ~  *=  1 ,  ...,W,  Sj  ^  0,  'y  '  Sj  —  1,  (C.4) 

j= 1  j= 1 

is  a  GMLE  of  s,  where  \ip  is  the  probability  measure  induced  by  F.  The  information 
in  (C.4)  can  be  formulated  in  a  matrix  form 

As  =  b.  (C.  5) 

Before  we  discuss  the  last  two  steps,  we  pointed  out  that  in  general,  given  an  (n+ 1)  x  m 
dimensional  matrix  A  with  rank  r  <  ra  —  1,  an  mxl  dimensional  vector  s  and  an  (n  + 1)  x  1 
dimensional  vector  6,  if  the  linear  equation  As  =  b  has  a  non-zero  solution,  then  the  solution 
is  not  unique  and  the  solutions  can  be  written  as  the  form 

,  •••,  Sjm  )  B(S{X  ,  ...,  Sjr  )  ,  Sjj,...,Sjr  G  7?., 

where  B  is  a  (m  —  r)  x  r  dimensional  matrix  and  (ii, ... ,im )  is  a  permutation  of  (1,  ...,m). 
However,  there  is  no  guarantee  that 

there  is  a  solution  that  satisfies  Si  >  0,  i  =  1,  ...,  m  and  Sjr+2  =  •  •  •  =  Sim  =  0.  (C.6) 

3.  We  established  in  [8]  that  (C.6)  holds  for  equation  (C.4)  or  (C.5)  and  proposed  a 
procedure  to  identify  the  indexes  ir+2,  im- 

4.  Then  the  likelihood  function  of  s  with  Str+2  =  •  •  •  =  sim  =  0  and  =  1  will 

have  a  non-singular  Fisher  information  matrix.  We  propose  to  find  a  GMLE  of  s  with 

^»r+2  '  *  ’  Sim  0. 

For  ease  in  understanding,  we  illustrate  our  idea  with  Example  1  above.  Note  that  a 
GMLE  assigns  weight  1/4  to  each  of  the  4  Mi’s  (Step  1).  Each  solution  s  (=  (si,  «2>  «3, 54)) 
to  the  set  of  linear  equations 


4 

«i  +  S2  =  1/2,  Si  +  S3  =  1/2,  S2  +  S4  =  1/2,  S3  +  S4  =  1/2,  Si  =  1, 

i=l 

is  a  GMLE  of  s  (Step  2).  The  equations  can  be  written  in  the  matrix  form  As  =  6: 


/I  1  0  0\ 
1  0  1  0  1 
0  10  1 
0  0  11 
\1  1  1  1/ 


/0.5\ 
0.5 
0.5 
0.5 
1  / 
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where  the  rank  of  A  is  r  =  2. 

For  these  equations,  there  exists  a  solution  such  that  S4  =  0  and  s*  >  0.  In  fact,  the 
solution  s  =  (0, 0.5, 0.5, 0)  is  another  GMLE  (Step  3). 

The  likelihood  function  is  given  by 


An  =  (Sl  +  S2)(Si  +  S3)S2S3  =  (1  -  S3)(l  -  S2)s2S3. 


For  this  choice  of  s,  the  covariance  matrix  of  (s2 


S3)  is  estimated  by  the  inverse  of  the  matrix 


<92  log  An 
ds2ds3 


(S2,S3)=(S2,S3) 


(1-S2)2 

0 


0 


(Step  4). 

(S2,S3)  =  (S2,S3) 


We  have  established  consistency  and  asymptotic  normality  of  this  procedure  under 
certain  regularity  conditions.  Our  research  here  extends  the  requirements  of  Tasks  2  and  3 
in  the  original  proposal.  For  more  details  we  refer  to  [8]. 

D.  KEY  RESEARCH  ACCOMPLISHMENTS  IN  THE  SECOND  YEAR 

•  We  have  completed  most  of  Task  3. 

The  GMLE  of  the  distribution  function  is  studied  and  its  consistency  and  asymptotic 
normality  are  established  under  various  assumptions  (C!,  C2,  Dl,  D2)  on  the  censoring 
random  vectors.  Part  of  our  results  are  published  in  a  peer-reviewed  journal  (see  [2]). 
The  rest  will  be  organized  in  two  manuscripts  under  preparation. 

•  We  have  completed  most  of  Task  4. 

We  have  established  consistency  and  asymptotic  normality  of  the  statistic  D.  The 
results  are  summarized  in  a  manuscript  which  is  under  preparation. 

•  We  have  resolved  the  non-uniqueness  problem  in  the  GML  inferences.  Consistency  and 
asymptotic  normality  (Extensions  of  Tasks  1,  2  and  3)  of  the  procedure  proposed  have 
been  established.  The  result  is  published  in  a  peer-reviewed  journal  (see  [8]). 

•  We  have  developed  computer  software  packages  for  implementing  the  GML  inferences 
when  the  GMLE  with  multivariate  interval-censored  data  is  not  unique.  It  is  an  exten¬ 
sion  of  Task  1. 

E.  REPORTABLE  OUTCOMES 

•  6  published  articles  in  journals  cited  in  the  science  citation  index:  [2],  [4],  [5],  [6],  [7], 

[8], 

•  Computer  programs  installed  in  math.binghamton.edu.ftp/pub/qyu. 
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F.  CONCLUSIONS 

In  the  first  two  year  of  our  DOD  grant,  we  have  successfully  accomplished  our  research 
objectives  stated  in  Tasks  1,  2,  3  and  4.  Under  the  multivariate  interval  censorship  mod¬ 
els,  we  have  established  consistency,  asymptotic  normality  and  asymptotic  efficiency  of  the 
GMLE  under  various  assumptions.  We  have  solved  a  non-uniqueness  problem  that  occurs 
in  multivariate  interval  censoring,  but  not  in  univariate  interval  censoring.  Moreover,  we 
have  implemented  computer  programs  for  carrying  out  the  asymptotic  GML  procedure. 

The  results  which  we  have  established  will  be  useful  to  breast  cancer  researchers  pursu¬ 
ing  chemoprevention  intervention  trials  involving  multiple  surrogate  endpoints  biomarkers, 
and  genetic  epidemiologists  conducting  studies  on  familial  aggregation  of  breast  cancer  and 
related  cancers. 
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